{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Prediction Verification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This script demonstrates how to train a single model class, embed the model, and solve the optimization problem for *classification* models. We fix a sample from our generated data and solve the optimization problem with all elements of $\\mathbf{x}$ equal to our data. In general, we might have some elements of $\\mathbf{x}$ that are fixed, called our \"conceptual variables,\" and the remaining indices are our decision variables. By fixing all elements of $\\mathbf{x}$, we can verify that the model prediction matches the original sklearn model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general, we assume that we are interested in constraining the output probability of a binary classification model to be above a lower bound (or below an upper bound, in the case of a negative outcome). Many binary classification models perform a nonlinear sigmoid transformation to convert the output to a probability. In this case, we must transform the desired threshold to allow us to maintain linear constraints. Furthermore, support vector classification returns a class label, rather than a probability. \n",
    "\n",
    "The handling of each classifcation model is described below:\n",
    "- **CART, ODT, and RF**: These models all return a probability that can be directly lower bounded by a threshold t. The constraint is of the form y ≥ t.\n",
    "- **Logistic Regression (\"Linear\"), GBM, MLP**: The embedding of these models returns a value y that can be constrained by the inequality: sigmoid(y) ≥ t. For a fixed t, we can represent this as a linear constraint, y ≥ -ln(1/t - 1). \n",
    "- **SVC**: SVC returns a binary label, rather than a probability. We assume that the user wants to constrain the output to fall in class 1 (rather than class 0) of the binary labelled data. The classifier will predict 1 if the linear decision function is at least 0. We embed y as the decision function output and add the constraint y ≥ 0."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the relevant packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import math\n",
    "from sklearn.utils.extmath import cartesian\n",
    "import time\n",
    "import sys\n",
    "import os\n",
    "import time\n",
    "\n",
    "from sklearn.metrics import roc_auc_score, r2_score, mean_squared_error\n",
    "from sklearn.cluster import KMeans"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from importlib import reload"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import opticl\n",
    "from pyomo import environ\n",
    "from pyomo.environ import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize data\n",
    "We will work with a basic dataset from `sklearn`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_classification\n",
    "from sklearn.model_selection import train_test_split\n",
    "X, y = make_classification(n_samples=200, n_features = 20, random_state=1)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
    "                                                    random_state=1)\n",
    "X_train = pd.DataFrame(X_train).add_prefix('col')\n",
    "X_test = pd.DataFrame(X_test).add_prefix('col')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the chosen model type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "alg = 'linear' \n",
    "task_type = 'binary'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The user can optionally select a manual parameter grid for the cross-validation procedure. We implement a default parameter grid; see **run_MLmodels.py** for details on the tuned parameters. If you wish to use the default, leave ```parameter_grid = None``` (or do not specify any grid)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "parameter_grid = None\n",
    "# parameter_grid = {'hidden_layer_sizes': [(5,),(10,)]} # mlp example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------- Initialize grid  ----------------\n",
      "------------- Running model  ----------------\n",
      "Algorithm = linear, metric = None\n",
      "saving... results/linear_temp_trained.pkl\n",
      "------------- Model evaluation  ----------------\n",
      "-------------------training evaluation-----------------------\n",
      "Train Score: 0.9539285714285715\n",
      "-------------------testing evaluation-----------------------\n",
      "Test Score: 0.8686371100164203\n"
     ]
    }
   ],
   "source": [
    "s = 1\n",
    "version = 'test'\n",
    "outcome = 'temp'\n",
    "\n",
    "model_save = 'results/%s/%s_%s_model.csv' % (alg, version, outcome)\n",
    "\n",
    "\n",
    "alg_run = alg if alg != 'rf' else 'rf_shallow'\n",
    "m, perf = opticl.run_model(X_train, y_train, X_test, y_test, alg_run, outcome, task = task_type, \n",
    "                       seed = s, cv_folds = 5, \n",
    "                       # The user can manually specify the parameter grid for cross-validation if desired\n",
    "                       parameter_grid = parameter_grid,\n",
    "                       save_path = model_save,\n",
    "                       save = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After training the model, we will save the trained model in the format needed for embedding the constraints. See **constraint_learning.py** for the specific format that is extracted per method. We also save the performance of the model to use in the automated model selection pipeline (if desired).\n",
    "\n",
    "We also create the save directory if it does not exist.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists('results/%s/' % alg):\n",
    "    os.makedirs('results/%s/' % alg)\n",
    "    \n",
    "constraintL = opticl.ConstraintLearning(X_train, y_train, m, alg)\n",
    "constraint_add = constraintL.constraint_extrapolation(task_type)\n",
    "constraint_add.to_csv(model_save, index = False)\n",
    "# \n",
    "perf.to_csv('results/%s/%s_%s_performance.csv' % (alg, version, outcome), index= False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check: what should the result be for our sample observation, if all x are fixed?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Choose sample to test\n",
    "This will be the observation (\"patient\") that we feed into the optimization model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_id = 0\n",
    "sample = X_train.loc[sample_id:sample_id,:].reset_index(drop = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate model prediction directly in sklearn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.4202077, 0.5797923]])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.predict_proba(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimization formulation\n",
    "We will embed the model trained above. The model could also be selected using the model selection pipeline, which we demonstrate in the WFP example script.\n",
    "\n",
    "If manually specifying the model, as we are here, the key elements of the ``model_master`` dataframe are:\n",
    "- model_type: algorithm name.\n",
    "- outcome: name of outcome of interest; this is relevant in the case of multiple learned outcomes.\n",
    "- save_path: file name of the extracted model.\n",
    "- objective: the weight of the objective if it should be included as an additive term in the objective. A weight of 0 omits it from the objective entirely.\n",
    "- lb/ub: the lower (or upper) bound that we wish to apply to the learned outcome. If there is no bound, it should be set to ``None``.\n",
    "\n",
    "In this case, we set the outcome to be our only objective term, which will allow us to verify that the predictions are consistent between the embedded model and the sklearn prediction function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_master = pd.DataFrame(columns = ['model_type','outcome','save_path','lb','ub','objective'])\n",
    "\n",
    "model_master.loc[0,'model_type'] = alg\n",
    "model_master.loc[0,'save_path'] = 'results/%s/%s_%s_model.csv' % (alg, version, outcome)\n",
    "model_master.loc[0,'outcome'] = outcome\n",
    "model_master.loc[0,'objective'] = 1\n",
    "model_master.loc[0,'ub'] = None\n",
    "model_master.loc[0,'lb'] = None\n",
    "model_master['SCM_counterfactuals'] = None\n",
    "model_master['features'] = [[col for col in X_train.columns]]\n",
    "model_master.loc[0,'task'] = task_type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pyomo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "model_pyo = ConcreteModel()\n",
    "\n",
    "## We will create our x decision variables, and fix them all to our sample's values for model verification.\n",
    "N = X_train.columns\n",
    "model_pyo.x = Var(N, domain=Reals)\n",
    "\n",
    "def fix_value(model_pyo, index):\n",
    "    return model_pyo.x[index] == sample.loc[0,index]\n",
    "\n",
    "model_pyo.Constraint1 = Constraint(N, rule=fix_value)\n",
    "\n",
    "## Specify any non-learned objective components - none here \n",
    "model_pyo.OBJ = Objective(expr=0, sense=minimize)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Embedding objective function for temp\n"
     ]
    }
   ],
   "source": [
    "final_model_pyo = opticl.optimization_MIP(model_pyo, model_pyo.x, model_master, X_train, tr = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Problem': [{'Name': 'x22', 'Lower bound': 0.32192080350292496, 'Upper bound': 0.32192080350292496, 'Number of objectives': 1, 'Number of constraints': 22, 'Number of variables': 22, 'Number of binary variables': 0, 'Number of integer variables': 0, 'Number of continuous variables': 22, 'Number of nonzeros': 23, 'Sense': 'minimize'}], 'Solver': [{'Status': 'ok', 'Return code': '0', 'Message': 'Model was solved to optimality (subject to tolerances), and an optimal solution is available.', 'Termination condition': 'optimal', 'Termination message': 'Model was solved to optimality (subject to tolerances), and an optimal solution is available.', 'Wall time': '0.0', 'Error rc': 0, 'Time': 0.10682272911071777}], 'Solution': [OrderedDict([('number of solutions', 0), ('number of solutions displayed', 0)])]}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "opt = SolverFactory('gurobi')\n",
    "results = opt.solve(final_model_pyo) \n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check for equality between sklearn and embedded models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sigmoid(x):\n",
    "    return 1 / (1 + math.exp(-x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True outcome: 0.5797923\n",
      "Pyomo output: 0.5797923\n"
     ]
    }
   ],
   "source": [
    "if alg in ['cart','iai','iai-single','rf','mlp','gbm','linear']:\n",
    "    true_outcome = np.array(m.predict_proba(sample))[0,1]\n",
    "elif alg == 'svm':\n",
    "    true_outcome = m.decision_function(sample)\n",
    "\n",
    "if alg in ['cart','iai','iai-single','rf','svm']:\n",
    "    pyo_outcome = final_model_pyo.OBJ()\n",
    "elif alg in ['mlp','gbm','linear']:\n",
    "    pyo_outcome = sigmoid(final_model_pyo.OBJ())\n",
    "\n",
    "\n",
    "print(\"True outcome: %.7f\" % true_outcome)\n",
    "print(\"Pyomo output: %.7f\" % pyo_outcome)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
